Parallel Programming with Microsoft Visual Studio 2010 : Introduction to Parallel Programming - Software Patterns

7/29/2011 5:13:46 PM

Parallel programming is not a new concept; it has been around for some time, although mostly on large or distributed systems. Parallel computing has more recently been available on personal computers with Threading Building Blocks (TBB), OpenMP, and other parallel solutions. So although parallel computing might seem new, the concepts are actually mature. For example, design patterns have been developed to help programmers design, architect, and implement a robust, correct, and scalable parallel application. The book Patterns for Parallel Programming by Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill (Addison-Wesley Professional, 2004) provides a comprehensive study on parallel patterns, along with a detailed explanation of the available design patterns and best practices for parallel programming. Another book, Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures by Colin Campbell et al. (Microsoft Press, 2010) is an important resource for patterns and best practices that target the .NET Framework and TPL.

Developers on average do not write much unique code. Most code concepts have been written somewhere before. Software pattern engineers research this universal code base to isolate standard patterns and solutions for common problems in a domain space. You can use these patterns as the building blocks that form the foundation of a stable application. Around these core components, you add the unique code for your application, as illustrated in the following diagram. This approach not only results in a stable application but is also a highly efficient way to develop an application.

Design patterns should be an integral part of the software development life cycle of every application. These patterns require thorough knowledge of your problem domain. All object-oriented programs model a problem domain, and parallel applications are no exception. As applications become more complex, knowing the problem domain increases in importance.

Patterns for Parallel Programming defines four phases of parallel development:

Finding Concurrency
Algorithm Structures
Support Structures
Implementation Mechanisms

The first two phases are design and analysis, which include tasks such as finding exploitable concurrency. These phases are the precursors to actually writing code. Later, you map the analysis onto code by using the Support Structures and Implementation Mechanisms phases.You can consider the TPL as a generic implementation of this pattern; it maps parallel programming onto the .NET Framework.

I urge you to explore parallel design patterns so you can benefit from the hard work of other parallel applications developers.

1. The Finding Concurrency Pattern

The first phase is the most important. In this phase, you identify exploitable concurrency. The challenge involves more than identifying opportunities for concurrency, because not every potential concurrency is worth pursuing. The goal is to isolate opportunities of concurrency that are worth exploiting.

The Finding Concurrency pattern begins with a review of the problem domain. Your goal is to isolate tasks that are good candidates for parallel programming—or conversely, exclude those that are not good candidates. You must weigh the benefit of exposing specific operations as parallel tasks versus the cost. For example, the performance gain for parallelizing a for loop with a short operation might not offset the scheduling overhead and the cost of running the task.

When searching for potential parallel tasks, review extended blocks of compute-bound code first. This is where you will typically find the most intense processing, and therefore also the greatest potential benefit from parallel execution.

Next, you decompose exploitable concurrency into parallel tasks. You can decompose operations on either the code or data axis (Task Decomposition and Data Decomposition, respectively). The typical approach is to decompose operations into several units. It’s easier to load balance a large number of discrete tasks than a handful of longer tasks. In addition, tasks of relatively equal length are easier to load balance than tasks of widely disparate length.

1.1. The Task Decomposition Pattern

In the Task Decomposition pattern, you decompose code into separate parallel tasks that run independently, with minimal or no dependencies. For example, functions are often excellent candidates for refactoring as parallel tasks. In object-oriented programming, functions should do one thing. However, this is not always the case. For longer functions, evaluate whether the function performs more than one task. If so, you might be able to decompose the function into multiple discrete tasks, improving the opportunities for parallelism.

1.2. The Data Decomposition Pattern

In the Data Decomposition pattern, you decompose data collections, such as lists, stacks, and queues, into partitions for parallel processing. Loops that iterate over data collections are the best locations for decomposing tasks by using the Data Decomposition pattern. Each task is identical but is assigned to a different portion of the data collection. If the tasks have short durations, you should consider grouping multiple tasks together so that they execute as a chunk on a thread, to improve overall efficiency.

1.3. The Group Tasks Pattern

After completing the Task and Data Decomposition patterns, you will have a basket of tasks. The next two patterns identify relationships between these tasks. The Group Task pattern groups related tasks, whereas the Order Tasks pattern imposes an order to the execution of tasks.

You should consider grouping tasks in the following circumstances:

Group tasks together that must start at the same time. The Order Task pattern can then refer to this group to set that constraint.
Group tasks that contribute to the same calculation (reduction).
Group tasks that share the same operation, such as loop operation.
Group tasks that share a common resource, where simultaneous access is not thread safe.

The most important reason to create task groups is to place constraints on the entire group rather than on individual tasks.

1.4. The Order Tasks Pattern

The Order Tasks pattern is the second pattern that sets dependencies based on task relationships. This pattern identifies dependencies that place constraints on the order (the sequence) of task execution. In this pattern, you often reference groups of tasks defined in the Group Task pattern. For example, you might reference a group of tasks that must start together.

Do not overuse this pattern. Ordering implies synchronization at best, and sequential execution at worst.

Some example of order dependencies are:

Start dependency This is when tasks must start at the same time. Here the constraint is the start time.
Predecessor dependency This occurs when one task must start prior to another task.
Successor dependency This happens when a task is a continuation of another task.
Data dependency This is when a task cannot start until certain information is available.

1.5. The Data Sharing Pattern

Parallel tasks may access shared data, which can be a dependency between tasks. Proper management is essential for correctness and to avoid problems such as race conditions and data corruptions. The Data Sharing pattern describes various methods for managing shared data. The goals are to ensure that tasks adhering to this pattern are thread safe and that the application remains scalable.

When possible, tasks should consume thread-local data. Thread-local data is private to the task and not accessible from other tasks. Because of this isolation, thread-local data is exempt from most data-sharing constraints. However, tasks that use thread-local data might require shared data for consolidation, accumulation, or other types of reduction. Reduction is the consolidation of partial results from separate parallel operations into a single value. When the reduction is performed, access to the shared data must be coordinated through some mechanism, such as thread synchronization.

Sharing data is expensive. Proposed solutions to safely access shared data typically involve some sort of synchronization. The best solution for sharing data is not to share data. This includes copying the shared data to a thread-local variable. You can then access the data privately during a parallel operation. After the operation is complete, you can perform a replacement or some type of merge with the original shared data to minimize synchronization.

The type of data access can affect the level of synchronization. Common data access types are summarized here:

Read-only This is preferred and frequently requires no synchronization.
Write-only You must have a policy to handle contention on the write. Alternatively, you can protect the data with exclusive locks. However, this can be expensive. An example of write-only is initializing a data collection from independent values.
Read-write The key word here is the write. Copy the data to a thread-local variable. Perform the update operation. Write results to the shared data, which might require some level of synchronization. If more reads are expected than writes, you might want to implement a more optimistic data sharing model—for example, spinning instead of locking.
Reduction The shared data is an accumulator. Copy the shared data to a thread-local variable. You can then perform an operation to generate a partial result. A reduction task is responsible for applying partial results to some sort of accumulator. Synchronization is limited to the reduction method, which is more efficient. This approach can be used to calculate summations, averages, counts, maximum value, minimal value, and more.

2. The Algorithm Structure Pattern

The result of the Finding Concurrency phase is a list of tasks, dependencies, and constraints for a parallel application. The phase also involves grouping related tasks and setting criteria for ordering tasks. In the Algorithm Structure phase, you select the algorithms you will use to execute the tasks. These are the algorithms that you will eventually implement for the program domain.

The algorithms included in the Algorithm Structure pattern adhere to four principles. These algorithms must:

Make effective use of processors.
Be transparent and maintainable by others.
Be agnostic to hardware, operating system, and environment.
Be efficient and scalable.

As mentioned, algorithms are implementation-agnostic. You might have constructs and features in your environment that help with parallel development and performance. The Implementation Mechanisms phase describes how to implement parallel patterns in your specific environment.

The Algorithm Structure pattern introduces several patterns based on algorithms:

Task Parallelism Pattern Arrange tasks to run efficiently as independent parallel operations. Actually, having slightly more tasks than processor cores is preferable—especially for input/output bound tasks. Input/output bound tasks might become blocked during input/output operations. When this occurs, extra tasks might be needed to keep additional processor cores busy.
Divide and Conquer Pattern Decompose a serial operation into parallel subtasks, each of which returns a partial solution. These partial solutions are then reintegrated to calculate a complete solution. Synchronization is required during the reintegration but not during the entire operation.
Geometric Decomposition Pattern Reduce a data collection into chunks that are assigned the same parallel operation. Larger chunks can be harder to load balance, whereas smaller chunks are better for load balancing but are less efficient relative to parallelization overhead.
Recursive Data Pattern Perform parallel operations on recursive data structures, such as trees and link lists.
Pipeline Pattern Apply a sequence of parallel operations to a shared collection or independent data. The operations are ordered to form a pipeline of tasks that are applied to a data source. Each task in the pipeline represents a phase. You should have enough phases to keep each processor busy. At the start and end of pipeline operations, the pipeline might not be full. Otherwise, the pipeline is full with tasks and maximizes processor utilization.

3. The Supporting Structures Pattern

The Supporting Structures pattern describes several ways to organize the implementation of parallel code. Fortunately, several of these patterns are already implemented in the TPL as part of the .NET Framework. For example, the .NET Framework 4 thread pool is one implementation of the Master/Worker pattern.

There are four Supporting Structures patterns:

SPMD (Single Program/Multiple Data) A single parallel operation is applied to multiple data sequences. In a parallel program, the processor cores often execute the same task on a collection of data.
Master/Worker The process (master) sets up a pool of executable units (workers), such as threads, that execute concurrently. There is also a collection of tasks whose execution is pending. Tasks are scheduled to run in parallel on available workers. In this manner, the workload can be balanced across multiple processors. The .NET Framework 4 thread pool provides an implementation of this pattern.
Loop Parallelism Iterations of a sequential loop are converted into separate parallel operations. Resolving dependencies between loop iterations is one of the challenges. Such dependencies were perhaps inconsequential in sequential applications but are problematic in a parallel version. The .Net Framework 4 provides various solutions for loop parallelism, including Parallel.For, Parallel.ForEach, and PLINQ (Parallel Language Integration Query).
Fork/Join Work is decomposed into separate tasks that complete some portion of the work. A unit of execution, such as a thread, spawns the separate tasks and then waits for them to complete. This is the pattern for the Parallel.Invoke method in the TPL.

The Supporting Structure phase also involves patterns for sharing data between multiple parallel tasks: the Shared Data, Shared Queue, and Distributed Array patterns. These are also already implemented in the .NET Framework, available as collections in the System.Collections.Concurrent namespace.